In this blog post, we'll have a look at the Kaggle What's Cooking data challenge.
This competition is all about predicting which country a recipe is from, given a list of its ingredient. For example, assume you have a recipe that reads:
plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil
Can you guess from which cuisine this is? As this recipe comes from the training set of this challenge, I can tell you that the expected answer is Southern US.
Without further ado, let's dive in. First, let us have a look at the training data.
Exploring the training data¶
We'll use pandas to go through the data. First, let's read the json file containing the recipes and the cuisines:
import pandas as pd
df_train = pd.read_json('train.json')
Let's look at the head of the data:
df_train.head()
We can see the structure of the data: a cuisine, an id for the recipe, and the ingredients.
As a first step, let's look at the cuisines in the dataset. How many and how much of these do we have?
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df_train['cuisine'].value_counts().plot(kind='bar')
As can be seen in this figure, there are a lot of Italian, Mexican and Southern US recipes, a little less of the other recipes.
To get a little insight in the data itself, we can look at a couple of recipes. In particular, we can count the most frequent ingredients for each cuisine. To do that, we can use the Python counter objects (found in the collections module from the standard library).
from collections import Counter
counters = {}
for cuisine in df_train['cuisine'].unique():
counters[cuisine] = Counter()
indices = (df_train['cuisine'] == cuisine)
for ingredients in df_train[indices]['ingredients']:
counters[cuisine].update(ingredients)
Let's look at a result:
counters['italian'].most_common(10)
We can easily convert the top 10 ingredients for each cuisine to a separate dataframe for nicer viewing:
top10 = pd.DataFrame([[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
index=[cuisine for cuisine in counters],
columns=['top{}'.format(i) for i in range(1, 11)])
top10
An even better visualisation would be to have images instead of words for this visualization. We can do this by exporting the previous table to html and replacing the ingredient names with HTML image tags for the selected ingredients. This is done using regular expression matching, while the image is base64 encoded in the source (thanks).
import re
import base64
import pdb
def repl(m):
ingredient = m.groups()[0]
image_path = 'img/' + ingredient + '.png'
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read())
return '<td><img width=100 src="data:image/png;base64,{}"></td>'.format(encoded_string.decode('utf-8'))
table_with_images = re.sub("<td>([ \-\w]+)</td>", repl, top10.to_html())
We can easily display this HTML output in our notebook:
from IPython.display import HTML
HTML(table_with_images)
This visualization allows us to determine a couple of things. For instance we can see that the top1 ingredient for each cuisine is a salty ingredient. This salty ingredient allows us to group the cuisines already:
- salt is the standard for most cuisines
- soy sauce is number one for chinese, japanese and korean cuisines
- fish sauce is number one for thai and vietnamese cuisines
Another things that is easily seen from this table is that many ingredients have more than one name:
- garlic cloves, garlic
- olive oil, extra-virgin olive oil
- ...
Jugding from this table, it can be interesting to see which ingredients among the top10 ingredients are highly specific for a certain cuisine. A way to do this is to simply count the number of times an ingredient appears in a given cuisine and divide by the total number of recipes.
To do this, we first create a new column in our dataframe by simply concatening the ingredients to a single string:
df_train['all_ingredients'] = df_train['ingredients'].map(";".join)
df_train.head()
We can now take advantage of the powerful string processing functions of pandas to check for the presence of an ingredient in a recipe:
df_train['all_ingredients'].str.contains('garlic cloves')
This can be used to group our recipes by the presence of that ingredient:
indices = df_train['all_ingredients'].str.contains('garlic cloves')
df_train[indices]['cuisine'].value_counts().plot(kind='bar',
title='garlic cloves as found per cuisine')
However, we have to keep in mind that there are a lot of Italian recipes in our database, so it's appropriate to divide by that number before presenting the result:
relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
relative_freq.sort(inplace=True)
relative_freq.plot(kind='bar')
This way of looking at the data lets us see which countries use garlic cloves a lot in the recipes found in the database. As expected, mediterranean and asian cuisines are at the top, and british at the bottom.
We can do this sort of plot for all top 10 ingredients. First let's determine the unique ingredients:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
Turns out we can fit this in a 8 by 8 subplot diagram:
fig, axes = plt.subplots(8, 8, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(64)):
indices = df_train['all_ingredients'].str.contains(ingredient)
relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
relative_freq.plot(kind='bar', ax=axes.ravel()[ax_index], fontsize=7, title=ingredient)
The previous diagram, even if it's not very clear, allows us to spot ingredients which have a high degree of uniqueness. Among them, I'd list:
- soy sauce (asian cuisine)
- sake (Japanese)
- sesame oil (asian cuisine)
- feta cheese crumbs (Greek)
- garam masala (Indian)
- ground ginger (Morrocan)
- avocado (Mexican)
Others are quite common:
- salt
- oil
- pepper
- sugar
This nicely concludes our data exploration. At the same time, it allows us to form a little intuition about how we could categorize a recipe's cuisine based on the ingredients:
- are there highly specific ingredients in the recipe that clearly point it to a given country?
In the next section, we will train a logistic regression classifier on the data we have analyzed so far and look at the results.
Training a logistic regression classifier¶
We will use scikit-learn to perform our classification. First, we will need to encode our features to a matrix that the machine learning algorithms in scikit learn can use. This is done using a count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
We can conveniently tell the count vectorizer which features it should accept and let him build the matrix with 1s and 0s when ingredients are present in a single step as follows:
cv = CountVectorizer()
X = cv.fit_transform(df_train['all_ingredients'].values)
We can check the shape of that matrix:
X.shape
We see that the vectorizer has retained 3010 ingredients and processed the 40 000 recipes in the training dataset. We can easily access the features to check them using the vectorizers properties (which is a dictionary):
print(list(cv.vocabulary_.keys())[:100])
Each feature gets assigned a column number, which is assigned a 1 or a 0 depending on the presence or not of the ingredient.
Now that we have our feature matrix, we still need to encode the labels that represent the cuisine of each recipe. This is done with a label encoder:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df_train.cuisine)
The variable y is now a vector with number instead of strings for each cuisine:
y[:100]
We can check the result by inspecting the encoders classes:
enc.classes_
Let's now train a logistic regression on the dataset. We'll split the dataset so that we can also test our classifier on data that he hasn't seen before:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Now, let's train a logistic regression:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X_train, y_train)
We can evaluate our classifier on our test set:
logistic.score(X_test, y_test)
It turns out it performs quite nicely, with a 78% accuracy.
However, this doesn't tell the whole story about what's happening. Let's inspect the classification results using a confusion matrix.
Inspecting the classification results using a confusion matrix¶
A confusion matrix allows us to see the confusion the classifier makes. It should be read column by column. In each column, one sees the recipes the classifier considered to be one cuisine. Looking at the color in each square one can see the relative accuracy of that classification.
from sklearn.metrics import confusion_matrix
plt.figure(figsize=(10, 10))
cm = confusion_matrix(y_test, logistic.predict(X_test))
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm_normalized, interpolation='nearest')
plt.title("confusion matrix")
plt.colorbar(shrink=0.3)
cuisines = df_train['cuisine'].value_counts().index
tick_marks = np.arange(len(cuisines))
plt.xticks(tick_marks, cuisines, rotation=90)
plt.yticks(tick_marks, cuisines)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
Here, we see that some cuisines are really well predicted (Moroccan, Thai, Indian) while some suffer from confusion (Greek is often predicted as other cuisines, same with irish).
Another way to look at the results is the classification report from scikit-learn:
from sklearn.metrics import classification_report
y_pred = logistic.predict(X_test)
print(classification_report(y_test, y_pred, target_names=cuisines))
This allows use to see the different precision measurements (accuracy, recall, f1 score) all in a single place.
From the previous analyses, we can come up with a number of ways of how to improve aspects of our machine learning and reach better classification results.
Conclusions¶
In this post, we've gone through different stages of machine learning: we first explored the data in depth that came with the challenge and then went on to train a model, whose results we tried to analyze. It's not clear from the results what we can easily improve in our classification, but it gives us quite a lot of information to analyze.